fc-Means has Polynomial Smoothed Complexity 
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The /c-means method is one of the most widely used clustering algorithms, drawing 
its popularity from its speed in practice. Recently, however, it was shown to have 
exponential worst-case running time. In order to close the gap between practical 
performance and theoretical analysis, the /c-means method has been studied in the 
model of smoothed analysis. But even the smoothed analyses so far are unsatisfactory 
as the bounds are still super-polynomial in the number n of data points. 

In this paper, we settle the smoothed running time of the A;- means method. We 
show that the smoothed number of iterations is bounded by a polynomial in n and 
l/o", where a is the standard deviation of the Gaussian perturbations. This means 
that if an arbitrary input data set is randomly perturbed, then the /c-means method 
will run in expected polynomial time on that input set. 

1 Introduction 
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Clustering is a fundamental problem in computer science with applications ranging from biology 
to information retrieval and data compression. In a clustering problem, a set of objects, usually 
represented as points in a high-dimensional space R*^, is to be partitioned such that objects 
in the same group share similar properties. The /c-means method is a traditional clustering 
algorithm, which is based on ideas by Lloyd [20]. It begins with an arbitrary clustering based on 
k centers in M*^, and then repeatedly makes local improvements until the clustering stabilizes. 
The algorithm is greedy and as such, it offers virtually no accuracy guarantees. However, it is 
both very simple and very fast, which makes it appealing in practice. Indeed, one recent survey 
of data mining techniques states that the /c-means method "is by far the most popular clustering 
algorithm used in scientific and industrial applications" [10]. 

However, theoretical analysis has long been at stark contrast with what is observed in practice. 
In particular, it was recently shown that the worst-case running time of the /c-means method is 
2^(") even on two-dimensional instances [25]. Conversely, the only upper bounds known for the 
general case are /c" andn'^('='^). Both upper bounds are based entirely on the trivial fact that the 
/c-means method never encounters the same clustering twice [16]. In contrast, Duda et al. state 
that the number of iterations until the clustering stabilizes is often linear or even sublinear in n 
on practical data sets [11, Section 10.4.3]. The only known polynomial upper bound, however, 
applies only in one dimension and only for certain inputs [15]. 
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So what does one do when worst-case analysis is at odds with what is observed in practice? We 
turn to the smoothed analysis of Spielman and Teng [24] , which considers the running time after 
first randomly perturbing the input. Intuitively, this models how fragile worst-case instances are 
and if they could reasonably arise in practice. In addition to the original work on the simplex 
algorithm, smoothed analysis has been applied successfully in other contexts, e.g., for the ICP 
algorithm [5], online algorithms [8], the knapsack problem [9], and the 2-opt heuristic for the 
TSP [13]. 

The /c-means method is in fact a perfect candidate for smoothed analysis: it is extremely 
widely used, it runs very fast in practice, and yet the worst-case running time is exponen- 
tial. Performing this analysis has proven very challenging however. It has been initiated by 
Arthur and Vassilvitskii who showed that the smoothed running time of the fc-means method is 
polynomially bounded in n'^ and l/o", where a is the standard deviation of the Gaussian pertur- 
bations [5]. The term •nJ' has been improved to min(n^, k^'^ ■ n) by Manthey and Roglin [21]. 
Unfortunately, this bound remains exponential even for relatively small values of k. In this paper 
we settle the smoothed running time of the A;- means method: We prove that it is polynomial in 
n and 1/a. The exponents in the polynomial are unfortunately too large to match the practical 
observations, but this is in line with other works in smoothed analysis, including Spielman and 
Teng's original analysis of the simplex method [24]. The arguments presented here, which reduce 
the smoothed upper bound from exponential to polynomial, are intricate enough without trying 
to optimize constants, even in the exponent. However, we hope and believe that our work can 
be used as a basis for proving tighter results in the future. 

1.1 fc-Means Method 

An input for the /c-means method is a set X C of n data points. The algorithm outputs 

k centers ci, . . . ,Cfc G M*^ and a partition of X into k clusters Ci, . . . The /c-means method 
proceeds as follows: 

1. Select cluster centers ci, . . . , G M*^ arbitrarily. 

2. Assign every x ^ X to the cluster Ci whose cluster center Cj is closest to it, i.e., ||x — Cj|| < 
||x — Cj II for all j 7^ i. 

3. Set Ci = 1^ Zlxec, ^• 

4. If clusters or centers have changed, goto [21 Otherwise, terminate. 

In the following, an iteration of /c-means refers to one execution of step 2 followed by step 3. 
A slight technical subtlety in the implementation of the algorithm is the possible event that a 
cluster loses all its points in Step 2. There exist some strategies to deal with this case [15]. For 
simplicity, we use the strategy of removing clusters that serve no points and continuing with the 
remaining clusters. 

If we define c{x) to be the center closest to a data point x, then one can check that each step 
of the algorithm decreases the following potential function: 

= ^||x - c(x)||^ . 

The essential observation for this is the following: If we already have cluster centers ci , . . . , G 
j^rf i-gprgggnting clusters, then every data point should be assigned to the cluster whose center is 
nearest to it to minimize ^ . On the other hand, given clusters Ci, . . . ,Ck, the centers ci, . . . , 
should be chosen as the centers of mass of their respective clusters in order to minimize the 
potential. 
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In the following, we will speak of A;-means rather than of the /c-means method for short. The 
worst-case running time of fc-means is bounded from above by (k'^n)^'^ < n'^'^, which follows 
from Inaba et al. [16] and Warren [28]. (The bound of 0{n^'^) frequently stated in the literature 
holds only for constant values for k and d, but in this paper k and d are allowed to grow.) 
This upper bound is based solely on the observation that no clustering occurs twice during an 
execution of fc-means since the potential decreases in every iteration. On the other hand, the 
worst-case number of iterations has been proved to be exp(y^) for d G 0.{^/n) [3]. This has 
been improved recently to exp(n) for d > 2 [25]. 

1.2 Related Work 

The problem of finding good A;-means clusterings allows for polynomial-time approximation 
schemes [6, 19, 22] with various dependencies of the running time on n, k, d, and the approxi- 
mation ratio 1 + e. The running times of these approximation schemes depend exponentially on 
k. Recent research on this subject also includes the work by Gaddam et al. [14] and Wagstaff 
et al. [27]. However, the most widely used algorithm for A;-means clustering is still the A;-means 
method due to its simplicity and speed. 

Despite its simplicity, the fc-means method itself and variants thereof are still the subject of 
research [4,17,23]. Let us mention in particular the work by Har-Peled and Sadri [15] who 
have shown that a certain variant of the A;-mcans method runs in polynomial time on certain 
instances. In their variant, a data point is said to be (1 -|- e)-misclassified if the distance to 
its current cluster center is larger by a factor of more than (1 + e) than the distance to its 
closest center. Their lazy k-means method only reassigns points that are (1 -|- e)-misclassified. 
In particular, for e = 0, lazy /c-means and fc-mcans coincide. They show that the number of 
steps of the lazy /c- means method is polynoniially bounded in the number of data points, 1/e, 
and the spread of the point set (the spread of a point set is the ratio between its diameter and 
the distance between its closest pair). 

In an attempt to reconcile theory and practice, Arthur and Vassilvitskii [5] performed the 
first smoothed analysis of the /c-means method: If the data points are perturbed by Gaussian 
perturbations of standard deviation a, then the smoothed number of iterations is polynomial in 
n'^, d, the diameter of the point set, and l/cr. However, this bound is still super-polynomial in 
the number n of data points. They conjectured that A;-means has indeed polynomial smoothed 
running time, i.e., that the smoothed number of iterations is bounded by some polynomial in n 
and 1/a. 

Since then, there has been only partial success in proving the conjecture. Manthey and 
Roglin improved the smoothed running time bound by devising two bounds [21]: The first is 
polynomial in and l/cr. The second is /c'^'^ poly(n, l/cr), where the degree of the polynomial 
is independent of k and d. Additionally, they proved a polynomial bound for the smoothed 
running time of /c-means on one-dimensional instances. 

1.3 Our Contribution 

We prove that the /c-means method has polynomial smoothed running time. This finally proves 
Arthur and Vassilvitskii's conjecture [5]. 

Theorem 1.1. Fix an arbitrary set X' C [0, 1]*^ of n points and assume that each point in 
X' is independently perturbed by a normal distribution with mean and standard deviation a, 
yielding a new set X of points. Then the expected running time of k-means on X is bounded by 
a polynomial in n and 1/a. 
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We did not optimize the exponents in the polynomial as the arguments presented here, which 
reduce the smoothed upper bound from exponential to polynomial, are already intricate enough 
and would not yield exponents matching the experimental observations even when optimized. We 
hope that similar to the smoothed analysis of the simplex algorithm, where the first polynomial 
bound [24] stimulated further research culminating in Vershynin's improved bound [26], our 
result here will also be the first step towards a small polynomial bound for the smoothed running 
time of /c-means. As a reference, let us mention that the upper bound on the expected number 
of iterations following from our proof is 



The idea is to prove, first, that the potential after one iteration is bounded by some polynomial 
and, second, that the potential decreases by some polynomial amount in every iteration (or, more 
precisely, in every sequence of a few consecutive iterations). To do this, we prove upper bounds 
on the probability that the minimal improvement is small. The main challenge is the huge 
number of up to n^^'^ possible clusterings. Each of these clusterings yields a potential iteration 
of A;-means, and a simple union bound over all of them is too weak to yield a polynomial bound. 

To prove the bound of poly(n^,l/cr) [21], a union bound was taken over the n^^'^ clusterings. 
This is already a technical challenge as the set of possible clusterings is fixed only after the 
points are fixed. To show a polynomial bound, we reduce the number of cases in the union 
bound by introducing the notion of transition blueprints. Basically, every iteration of fc-means 
can be described by a transition blueprint. The blueprint describes the iteration only roughly, so 
that several iterations are described by the same blueprint. Intuitively, iterations with the same 
transition blueprint are correlated in the sense that either all of them make a small improvement 
or none of them do. This dramatically reduces the number of cases that have to be considered 
in the union bound. On the other hand, the description conveyed by a blueprint is still precise 
enough to allow us to bound the probability that any iteration described by it makes a small 
improvement. 

We distinguish between several types of iterations, based on which clusters exchange how 
many points. Sections 14.11 to 14.51 deal with some special cases of iterations that need separate 
analyses. 

After that, we analyze the general case (Section 14. 6p . The difficulty in this analysis is to 
show that every transition blueprint contains "enough randomness" . We need to show that this 
randomness allows for sufficiently tight upper bounds on the probability that the improvement 
obtained from any iteration corresponding to the blueprint is small. 

Finally, we put the six sections together to prove that /c-means has polynomial smoothed 
running time (Section 14. 7p . 

2 Preliminaries 

For a finite set X C R"^, let cm.{X) = Ylx&x center of mass of the set X. If H Q 

is a hyperplane and x € M"^ is a single point, then dist(x,i?) = min{||x — y\\ | y G H} denotes 
the distance of the point x to the hyperplane H. 

For our smoothed analysis, an adversary specifies an instance X' C [0, 1]*^ of n points. Then 
each point x' E X' is perturbed by adding an independent d-dimensional Gaussian random 
vector with standard deviation a to x' to obtain the data point x. These perturbed points form 
the input set X. For convenience we assume that a < 1. This assumption is without loss of 
generality as for larger values of a, the smoothed running time can only be smaller than for 
0" = 1 [21, Section 7]. Additionally we assume k < n and d < n: First, k < n is satisfied after 
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the first iteration since at most n clusters can contain any points. Second, /c-means is known to 
have polynomial smoothed complexity for d G r2(n/logn) [3]. The restriction of the adversarial 
points to be in [0, 1]"^ is necessary as, otherwise, the adversary can diminish the effect of the 
perturbation by placing all points far apart from each other. Another way to cope with this 
problem is to state the bounds in terms of the diameter of the adversarial instance [5]. However, 
to avoid having another parameter, we have chosen the former model. 

Throughout the following, we assume that the perturbed point set X is contained in some 
hypercube of side-length D, i.e., X C [— £)/2, Z)/2]'^ = T>. We choose D such that the probability 
oi X V \s bounded from above by n~'^^'^. Then, as the worst-case number of iterations is 
bounded by n?^'^ [16], the event X ^^T) contributes only an insignificant additive term of to 
the expected number of iterations, which we ignore in the following. 

Since Gaussian random vectors are heavily concentrated around their mean and all means are 
in [0, 1]*^, we can choose D = y^90kd ln(n) to obtain the desired failure probability for X <^T>, 
as shown by the following calculation, in which Z denotes a one-dimensional Gaussian random 
variable with mean and standard deviation 1: 

Pt[X <^V] <nd- Pr[|Z| > D/2 - l] < 2nd ■ Pr[Z > D/3] 

< ^ • exp(-£>Vl8) < • exp(-DVl8) < n.-^'''^ , 
V 27r 



where we used k>2, d>2, and the tail bound Pr[Z > z] < ^'^'^^^'^^ for Gaussians [12]. 

For our smoothed analysis, we use essentially three properties of Gaussian random variables. 
Let X be a d-dimensional Gaussian random variable with standard deviation a. First, the 
probability that X assumes a value in any fixed ball of radius e is at most {ejaY . Second, let 
6i, . . . , 6^/ G be orthonormal vectors for some (X < d. Then the vector {bi-X,. . . , b^'-X) G 
is a (i'-dimensional Gaussian random variable with the same standard deviation a. Third, let 
H be any hyperplane. Then the probability that a Gaussian random variable assumes a value 
that is within a distance of at most e from H is bounded by s/a. This follows also from the 
first two properties if we choose d' = 1 and bi to be the normal vector of H. 

We will often upper-bound various probabilities, and it will be convenient to reduce the 
exponents in these bounds. Under certain conditions, this can be done safely regardless of 
whether the base is smaller or larger than 1. 

Fact 2.1. Let p be a probability, and let A, c, 6, e, and e' be positive real numbers satisfying c > 1 
and e > e' . If p < A + c - b^, then it is also true that p < A + c-W . 

Proof. If b is at least 1, then A + c-W' > 1 and it is trivially true that p < A + c-b^' . Otherwise, 
b^ <b^ , and the result follows. □ 



2.1 Potential Drop in an Iteration of fe-Means 

During an iteration of the A;-means method there are two possible events that can lead to 
a significant potential drop: either one cluster center moves significantly, or a data point is 
reassigned from one cluster to another and this point has a significant distance from the bisector 
of the clusters (the bisector is the hyperplane that bisects the two cluster centers). In the 
following we quantify the potential drops caused by these events. 

The potential drop caused by reassigning a data point x from one cluster to another can 
be expressed in terms of the distance of x from the bisector of the two cluster centers and 
the distance of these two centers. The following lemma follows from basic linear algebra (cf., 
e.g., [21, Proof of Lemma 4.5]). 



5 



Lemma 2.2. Assume that, in an iteration of k-means, a point x ^ X switches from d to Cj. 
Let Ci and Cj be the centers of these clusters, and let H he their bisector. Then reassigning x 
decreases the potential by 2 • ||cj — Cj\\ ■ dist{x,H). 

The following lemma, which also follows from basic linear algebra, reveals how moving a 
cluster center to the center of mass decreases the potential. 

Lemma 2.3 (Kanungo et al. [18]). Assume that the center of a cluster C moves from c to cm(C) 
during an iteration of k-means, and let \C\ denote the number of points inC when the movement 
occurs. Then the potential decreases by \C\ ■ \\c — cm(C)|p. 

2.2 The Distance between Centers 

As the distance between two cluster centers plays an important role in Lemma 12.21 we analyze 
how close together two simultaneous centers can be during the execution of /c-means. This has 
already been analyzed implicitly [21, Proof of Lemma 3.2], but the variant below gives stronger 
bounds. From now on, when we refer to a fc-means iteration, we will always mean an iteration 
after the first one. By restricting ourselves to this case, we ensure that the centers at the 
beginning of the iteration are the centers of mass of actual clusters, as opposed to the arbitrary 
choices that were used to seed /c-means. 

Definition 2.4. Let 6^ denote the minimum distance between two cluster centers at the beginning 
of a k-means iteration in which (1) the potential ^ drops by at most e, and (2) at least one data 
point switches between the clusters corresponding to these centers. 

Lemma 2.5. Fix real numbers Y >\ and e > 2. Then, for any e £ [0, 1], 



Proof. Consider a /c-means iteration I that results in a potential drop of at most e, and let Iq 
denote the previous iteration. Also consider a fixed pair of clusters that exchange at least one 
data point during I. We define the following: 

• Let ao and 6o denote the centers of the two clusters at the beginning of iteration Iq and 
let Hq denote the hyperplane bisecting ao and 6o- 

• Let A and B denote the set of data points in the two clusters at the beginning of iteration 
/. Note that Hq splits A and B. 

• Let a and b denote the centers of the two clusters at the beginning of iteration /, and let 
H denote the hyperplane bisecting a and b. Note that a = cm{A) and b = cm{B). 

• Let A' and B' denote the set of data points in the two clusters at the end of iteration /. 
Note that H splits A' and B'. 

• Let a' and b' denote the centers of the two clusters at the end of iteration /. Note that 
a' = cm{A') and b' = cm{B'). 

Also let t = 3d + [e] . Now suppose we have \\a — b\\ < Ye^l^ . 

First we consider the case U ^| > f -|- 1. We claim that every point in A must be within 
distance riYe^l^ of TLq. Indeed, if this were not true, then since ^Tq splits A and B, and since 
a = cm(yl) and b = cm{B), we would have \\a — b\\ > dist(a, -ffo) > "'^^i^" > Ye^^'^, giving a 




6 



contradiction. Furthermore, as / results in a potential drop of at most e, Lemma 12.31 implies 
that \\a' — a\\, \\b' — b\\ < y/e, and therefore, 



a 



' - b'W < \\a' - all + ||a - 6|| + ||6 - 6'|| < Ye^/^ + 2^^ < 3Ye^/^. 



In particular, we can repeat the above argument to see that every point in A' must be within 
distance 3nYe^^^ of H. This means that there are two hyperplanes such that every point in 
Au A' is within distance 3nYe^^^ of at least one of these hyperplanes. Following the arguments 
by Arthur and Vassilvitskii [5, Proposition 5.9], we obtain that the probability that there exists 
a set Au A' of size t + 1 with this property is at most 



/ 12(inyeVe 
a 



t+l~2d / , , \ d+|e|+l 




This bound can be proven as follows: Arthur and Vassilvitskii [5, Lemma 5.8] have shown that 
we can approximate H and Hq by hyperplanes H and -ffo that pass through d points from X 
exactly such that any point x £ X within distance L of or Hq has a distance of at most 
2dL from H or Hq, respectively. A union bound over all choices for these 2d points and the 
remaining t + 1 — 2d points yields the term n*"*"^. Once H and Hq are fixed, the probability that 
a random point is within distance 2dL of at least one of the hyperplanes is bounded from above 
by AdL/a. Taking into account that the remaining t + 1 — 2d points are independent Gaussians 
yields a final bound of n*+i(4dL/c7)*+^-2'^ with L = 'inYe^l\ 

Note that this quantity bounds the probability that there exists an iteration with \A' U ^| > 
t + 1 satisfying the conditions in the lemma statement; it does not apply only to the fixed 
iteration / that we were considering earlier. 

Next we consider the case \A' U A\ <t. We must have A' ^ A since some point is exchanged 
between clusters A and B during iteration I. Consider some fixed A and A' , and let xq be a data 
point in the symmetric difference of A and A'. Then cm(^')— cm(A) can be written as Ylxi^x ^x'X 
for constants with |c^g| > We consider only the randomness in the perturbed position of 
Xq and allow all other points in X to be fixed adversarially. Then cm(^') — cm(^) follows a 
normal distribution with standard deviation at least -, and hence ||cm(A') — cm(j4)|| < ^/e with 
probability at most {riy/e / a)'^ . On the other hand, Lemma [2.31 implies that ||cm(A') — cm(A)|| < 
i/e must hold for iteration /. Otherwise, / would result in a potential drop of at most e. Now, 
the total number of possible sets A and A' is bounded by (4?7-)*: we choose t candidate points 
to be in ^ U A' and then for each point, we choose which set(s) it is in. Taking a union bound 
over all possible choices, we see the case \A' U A\ <t can occur with a probability of at most 



'43+LeJ/d„4+[eJ/d^\ 



= z ■ (2) 



a 



Combining equations ([I]) and ([2|), we have 



/l2dn4y£l/ey+LeJ+l /^3+LeJ/d^4+b 

Pr[4 < Ye^'^] < + 



a 
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Note that (i + [ej + 1 > e and d >2, so we can reduce exponents according to Fact I2.lt 
Pr[4 < y^Ve] < (}2dn^\ ' ^ / 43+LeJ/.^4+LeJ/.^ \ ^ 



< e ■ j + e ■ 2 I since d>2 

< e • + £ ■ I since d < n, e > 2 and cr < 1 

V (J ; V ^ 



3 Transition Blueprints 

Our smoothed analysis of A;- means is based on the potential function ^. If A" C then after the 
first iteration, ^ will always be bounded from above by a polynomial in n and 1/a. Therefore, 
fc-means terminates quickly if we can lower-bound the drop in ^ during each iteration. So 
what must happen for a fc-means iteration to result in a small potential drop? Recall that any 
iteration consists of two distinct phases: assigning points to centers, and then recomputing center 
positions. Furthermore, each phase can only decrease the potential. According to Lemmas 12.21 
and 12.31 an iteration can only result in a small potential drop if none of the centers move 
significantly and no point is reassigned that has a significant distance to the corresponding 
bisector. The previous analyses [5,21] essentially use a union bound over all possible iterations 
to show that it is unlikely that there is an iteration in which none of these events happens. Thus, 
with high probability, we get a significant potential drop in every iteration. As the number of 
possible iterations can only be bounded by n^^'^, these union bounds are quite wasteful and yield 
only super-polynomial bounds. 

We resolve this problem by introducing the notion of transition blueprints. Such a blueprint 
is a description of an iteration of /c-means that almost uniquely determines everything that 
happens during the iteration. In particular, one blueprint can simultaneously cover many similar 
iterations, which will dramatically reduce the number of cases that have to be considered in the 
union bound. We begin with the notion of a transition graph, which is part of a transition 
blueprint. 

Definition 3.1. Given a k-means iteration, we define its transition graph to be the labeled, 
directed multigraph with one vertex for each cluster, and with one edge {Ci,Cj) with label x for 
each data point x switching from cluster Ci to cluster Cj . 

We define a vertex in a transition graph to be balanced if its in-degree is equal to its out- 
degree. Similarly, a cluster is balanced during a A;-means iteration if the corresponding vertex 
in the transition graph is balanced. 

To make the full blueprint, we also require information on approximate positions of cluster 
centers. We will see below that for an unbalanced cluster this information can be deduced from 
the data points that change to or from this cluster. For balanced clusters we turn to brute force: 
We tile the hypercube V with a lattice L^, where consecutive points are are at a distance of 
■sjnejd from each other, and choose one point from for every balanced cluster. 

Definition 3.2. An (m, 6, e) transition blueprint B consists of a weakly connected transition 
graph G with m edges and b balanced clusters, and one lattice point in for each balanced 
cluster in the graph. A k-means iteration is said to follow B if G is a connected component of 
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the iteration's transition graph and if the lattice point selected for each balanced cluster is within 
a distance of at most yjne of the cluster's actual center position. 

If X <^T>, then by the Pythagorean theorem, every cluster center must be within distance \/ne 
of some point in L^. Therefore, every /c-means iteration follows at least one transition blueprint. 

As m and b grow, the number of valid (m, b, e) transition blueprints grows exponentially, 
but the probability of failure that we will prove in the following section decreases equally fast, 
making the union bound possible. This is what we gain by studying transition blueprints rather 
than every possible configuration separately. 

For an unbalanced cluster C that gains the points A C A" and loses the points B C X during 
the considered iteration, the approximate center of C is defined as 

\B\ cm{B) - \A\ cm{A) 
\B\^\ ■ 

If C is balanced, then the approximate center of C is the lattice point specified in the transition 
blueprint. The approximate bisector of Ci and Cj is the bisector of the approximate centers of 
Ci and Cj. Now consider a data point x switching from some cluster Cj to some other cluster Cj. 
We say the approximate bisector corresponding to x is the hyperplane bisecting the approximate 
centers of Ci and Cj. Unfortunately, this definition applies only if Cj and Cj have distinct 
approximate centers, which is not necessarily the case (even after the random perturbation). 
We will call a blueprint non- degenerate if the approximate bisector is in fact well defined for 
each data point that switches clusters. The intuition is that, if one actual cluster center is 
far away from its corresponding approximate center, then during the considered iteration the 
cluster center must move significantly, which causes a potential drop according to Lemma 12.31 
Otherwise, the approximate bisectors are close to the actual bisectors and we can show that it is 
unlikely that all points that change their assignment are close to their corresponding approximate 
bisectors. This will yield a potential drop according to Lemma l2.2i 

The following lemma formalizes what we mentioned above: If the center of an unbalanced 
cluster is far away from its approximate center, then this causes a potential drop in the corre- 
sponding iteration. 

Lemma 3.3. Consider an iteration of k-means in which a cluster C gains a set A of points and 
loses a set B of points with \A\ ^ \B\. If ||cm(C) — l-^l ^^^^^ || > then the potential 

decreases by at least e. 

Proof. Let C = (C\B)U A denote the cluster after the iteration. According to Lemma [2.31 the 
potential drops in the considered iteration by at least 



\C'\ ■ \\cm{C') - cm(C)f = (|C| + \A\ - \B\ 

\\B\ - \A\\ 



ICI cm(C) + 1^1 cmM) - IBI cm(B) 

cm(C) 



\C\ + \A\-\B\ 



\C\ + \A\-\B\ 

|S|cm(B) - lAlcmM) ^ ^ /^^2 
cm(C) 



\B\ - \A\ 



n 



Now we show that we get a significant potential drop if a point that changes its assignment 
is far from its corresponding approximate bisector. Formally, we will be studying the following 
quantity A(^). 

Definition 3.4. Fix a non- degenerate {m,b,£) -transition blueprint B. Let A{B) denote the 
maximum distance between a data point in the transition graph of B and its corresponding 
approximate bisector. 
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Lemma 3.5. Fixe G [0,1] and a non- degenerate {m,b,e) -transition blueprints. If there exists 
an iteration that follows B and that results in a potential drop of at most e, then 



5e ■ A{B) < GDVmk. 

Proof. Fix an iteration that follows B and that results in a potential drop of at most e. Consider 
a data point x that switches between clusters Cj and Cj during this iteration. Let p and q denote 
the center positions of these two clusters at the beginning of the iteration, and let p' and q' denote 
the approximate center positions of the clusters. Also let H denote the hyperplane bisecting p 
and q, and let H' denote the hyperplane bisecting p' and q' . 

We begin by bounding the divergence between the hyperplanes H and H' . 



Claim 3.6. Letu andv be arbitrary points on H . Then, dist{v, H')—dist{u, H') < ■ | 



v — u 



Proof. Let 6 denote the angle between the normal vectors of the hyperplanes H and H' . We 
move the vector p'q' to become pq" for some point q" , which ensures Zqpq" = 9. Note that 
\\q" — q\\ < \\q" — q'W + \\q' — q\\ = \\p — p'\\ + \\q' — q\\ < by Lemma [STSl 

Let r be the point where the bisector of the angle Zqpq" hits the segment qq". By the sine 
law, we have 



sin I — 1 = sm{Zprq) 



2 J ' " \\p-q\\ 

^ y - q\\ ^ '^VrH 



Let y and y' be unit vectors in the direction pq and p'q\ respectively, and let z be an arbitrary 
point on H' . Then, 

dist(v, H') — dist(ti, H') = \{v — z) ■ y'\ — \ {u — z) ■ y'\ 

<\{v — u) ■ y'\ by the triangle inequality 

= \{v -u) - y+iv -u)- {y - y)\ 

= \ {v — u) ■ {y — y)\ since u,v £ H 

< \\v — u\\ ■ \\y' — y\\. 

Now we consider the isosceles triangle formed by the normal vectors y and y' . The angle between 
y and y' is 9. Using the sine law again, we get 



\y - y|| = 2 • sm - < — — , 



and the claim follows. □ 

We now continue the proof of Lemma 13.51 Let h denote the foot of the perpendicular from x 
to H, and let m = Then, 

dist{x,H') < +dist(/i,if') 

= dist(2;, H) + dist(m, H') + dist(/i, H') - dist(m, H') 

< dist(x,H)+dist(m,H') + ^^^-\\h-m\\, (3) 
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where the last inequahty foUows from Claim [3^ By Lemma[221 we know that the total potential 
drop during the iteration is at least 2 • ||p — q\\ • dist(x, H). However, we assumed that this drop 
was at most e, so we therefore have dist(j;, H) < gf-- Also, by Lemma 13.31 

A- U'\ ^ P' + ^ + ^ ^ 1 II ' 11,^11' II ^ ^ 

dist(m,i7)< — — <--\\p -p\\ + --\\q -q\\<Vne. 

Furthermore, \\h — m\\ < \\m — x II < DVd since h — m is perpendicular to x — h and m — x lies 
in the hypercube [—D/2,D/2]'^. Plugging these bounds into equation 1^, we have 

£ / — 4:DV nde 
dist(x, H')<— + + 

2de Oe 

— 5D\/ nde 

< \/ne -\ since e < 1 




smce Se < \\p-q\\ < DVd. 
This bound holds for all data points x that switch clusters, so the lemma follows. □ 

4 Analysis of Transition Blueprints 

Let A denote the smallest improvement of the potential ^ made by any sequence of three 
consecutive iterations of the /c-means method. In the following, we will define and analyze some 
variables Aj such that A can be bounded from below by the minimum of the Aj. These random 
variables are essentially a case analysis covering different types of transition graphs. The first 
five cases deal with special types of blueprints that require separate attention and do not fit into 
the general framework of case six. The sixth and most involved case (Section 14. 6|) deals with 
general blueprints. 

When analyzing these random variables, we will ignore the case that a cluster can lose all its 
points in one iteration. If this happens, then fe-means continues with one cluster less, which can 
happen only k times. Since the potential ^ does not increase even in this case, this gives only 
an additive term of k to our analysis. 

In the lemmas in this section, we do not specify the parameters m and b when talking about 
transition blueprints. When we say an iteration follows a blueprint with some property P, we 
mean that there are parameters m and b such that the iteration follows an (m, b, e) transition 
blueprint with property P, where e will be clear from the context. 



4.1 Balanced Clusters of Small Degree 

Lemma 4.1. Fix e > and a constant zi G N. Let Ai denote the smallest improvement made 
by any iteration that follows a blueprint with a balanced non-isolated node of in- and outdegree 
at most zid. Then, 




Proof. We denote the balanced cluster of in- and outdegree at most zidhy C. If the center of 
C moves by S, then the potential drops by at least |C|5^. Hence, Ai can only be smaller than e 
if the center of C moves by at most ye?/|C| during the considered iteration. Let A and B with 
1^1 = \B\ < zid be the sets of data points corresponding to the incoming and outgoing edges 
of C, respectively. If \A\ cm(A) and \B\ cm(i?) differ by at least ^/ne > Y^|C|e, then the cluster 
center moves by at least \Jel\C\ as shown by the following reasoning: Let c be the center of 
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mass of the points that belong to C at the beginning of the iteration and remain in C during 

]c\ m 



the iteration. Then the center of mass of C moves from — — (KiJ — \A\)c+\B\cm{B) 



Since |^| 



\B\, these two locations differ by 

1^1 cm{B) - \A\ cm.{A) 



\C\ 



By Lemma 12.31 this causes a potential drop of at least |C|(y^e/|C|)^ = e. The random variable 
\A\ cm{A) is a Gaussian random variable with a standard deviation of y^Pijo" > a. If the points 
of B are fixed arbitrarily, then \A\ cm{A) has to assume a position within distance ^/ne of 
\B \ cm{B) for the iteration to make an improvement of at most e. 

Now we apply a union bound over all possible choices of A and B. We can assume that both 
A and B contain exactly zid points. Otherwise, we can pad them by adding the same points to 
both of them, which does not affect the analysis. Hence, the number of choices is bounded by 
j^2zid^ and we get 

Pr[Ai < e] < Pt[3A,B,\A\ = \B\ = zid: |||^|cm(y4) - |S|cm(S)|| < 



< n 



2zid 



a 



< 



a 



□ 



Using Fact 12.11 and d> 2 concludes the proof. 
4.2 Nodes of Degree One 

Lemma 4.2. Fix e S [0, 1]. Let A2 denote the smallest improvement made by any iteration that 
follows a blueprint with a node of degree 1. Then, 



n 



Proof. Assume that a point x switches from cluster Ci to cluster C2, and let ci and C2 denote the 
positions of the cluster centers at the beginning of the iteration. Let v be the distance between 
Cl and 02- Then C2 has a distance of u/2 from the bisector of ci and C2, and the point x is on 
the same side of the bisector as 02- 

If Cl has only one edge, then the center of cluster Ci moves during this iteration by at 
least 2(|cfpi)) where \Ci\ denotes the number of points belonging to Ci at the beginning of the 
iteration: the point x has a distance of at least u/2 from ci, which yields a movement of 



Cl 



Cl Cl — X 




Cl — X 


ICil-1 




\Ci\-l 



> 



2{\Ci 



> 



Hence, the potential drops by at least (|Ci| — ^){ 2\Ci\-2 J — 4\Ch\ — 4"' 

If C2 has only one edge, then let a be the distance of the point x to the bisector of ci and C2 
By reassigning the point, we get a potential drop of 2ai'. Additionally, ||x — C2II > \i^/2 
Thus, C2 moves by at least 



a\. 



C2 



C2IC2I + X 



IC2I + 1 



> 


C2 - 2; 


IC2I+II 





> 



|i//2 



a 



IC2I + I 



This causes a potential drop of at least (IC2I + l)(z^/2 - 
(z//2 — of' jn. Hence, the potential drops by at least 



-a)V(|C2| + l)2 = W2-a)V(|C2| + l)> 



2av + 



{yl2 - af ^ {v/2 + a) 



> 



> 



n 



n 



An 
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We can assume u > 5s since 5s denotes the closest distance between any two simultaneous 
centers in iterations leading to a potential drop of at most e. To conclude the proof, we combine 
the two cases: If Ci has only one edge, the potential drop can only be bounded from above by e 
if £ > 5^ > 4^- Similarly, if C2 has only one edge, the potential drop can only be bounded from 
above by e if e > Hence, Lemma [231 yields 
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Pr[A2 < e] < Pr[(5^/(4n) < e] = Pr 
4.3 Pairs of Adjacent Nodes of Degree Two 

Given a transition blueprint, we now look at pairs of adjacent nodes of degree 2. Since we have 
already dealt with the case of balanced clusters of small degree (Section 14. we can assume 
that the nodes involved are unbalanced. This means that one cluster of the pair gains two points 
while the other cluster of the pair loses two points. 

Lemma 4.3. Fix e £ [0, 1]. Let A3 denote the smallest improvement made by any iteration that 
follows a non- degenerate blueprint with at least three disjoint pairs of adjacent unbalanced nodes 
of degree 2. Then, 

30^ 



Proof. Fix a transition blueprint B containing at least 3 disjoint pairs of adjacent unbalanced 
degree-two nodes. We first bound Pr[A(i3) < A]. For i = 1,2,3, let Oj, bi, and q denote the 
data points corresponding to the edges in the i^^ pair of adjacent degree-two nodes, and assume 
without loss of generality that bi corresponds to the inner edge (the edge that connects the pair 
of degree-two nodes). 

Let Ci and C- be the clusters corresponding to one such pair of nodes. Since Ci and C- are 
unbalanced, we can further assume without loss of generality that Ci loses both data points Oj 
and bi during the iteration, and C'i gains both data points bi and Cj. 

Now, Ci has its approximate center at pi = Si±k and C'^ has its approximate center at qi = 
. Since B is non-degenerate, we know pi ^ qi and hence Oj / Cj. Let Hi denote the 
hyperplane bisecting and q, and let H[ denote the hyperplane bisecting pi and qi. Since Hi 
is the image of H'^ under a dilation with center bi and scale 2, we have 

/ , ,sN maxj ( dist(6j, , , 

K{B) > max ( dist(6„ H'i)) = . (4) 

All three pairs of adjacent degree- two nodes are disjoint, so we know bi is distinct from bj for 
j ^ i and distinct from aj and Cj for all j. This implies the position of bi is independent of bj for 
j 7^ i, and it is also independent of the position and orientation of Hj for all j. In particular, the 
quantities dist{bi,Hi) follow independent one-dimensional normal distributions with standard 
deviation a. Therefore, for any A > 0, we have 



Pr [A{B) < A] < Pr 



max ( dist{bi,Hi)) < 2 A 



< 

a 



2A^ ^ 



Let B denote the set of non-degenerate transition blueprints containing at least three disjoint 
pairs of unbalanced degree- two nodes. The preceding analysis of Pr[A(;B) < A] depends only on 
{ai,bi,Ci} so we can use a union bound over all choices of {oj, 6j, q} as follows: 



Pr 



3B E 



A(B)<A <„'.(^)^(?=;A)\ (5) 
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Now, Lemma 13.51 yields that if an iteration can follow a blueprint B and result in a potential 
drop of at most e, then 6^ ■ A{B) < 6D\/ nde. We must therefore have either 5^ < 

gl/6 Qj, 

A(;B) < QD^/nd ■ e^/^. We bound the probability that this can happen using Lemma 12.51 and 
equation ([Sj): 



Pr[A3<e] < Pr 
< e • 



6s < e^" 
0(1) -n 



+ Pr 



3B G 



a 



0{l)-n 



30 



smce D = Y^md •ln(n), a < 1, and d,k < n. 



(l2Dn^y/ml 



3 



□ 



4.4 Blueprints with Constant Degree 

Now we analyze iterations that follow blueprints in which every node has constant degree. It 
might happen that a single iteration does not yield a significant improvement in this case. But 
we get a significant improvement after three consecutive iterations of this kind. The reason 
for this is that during three iterations one cluster must assume three different configurations. 
One case in the previous analyses [5,21] is iterations in which every cluster exchanges at most 
0{dk) data points with other clusters. The case considered in this section is similar, but instead 
of relying on the somewhat cumbersome notion of key-values used in the previous analyses, we 
present a simplified and more intuitive analysis here, which also sheds more light on the previous 
analyses. 

We define an epoch to be a sequence of consecutive iterations in which no cluster center 
assumes more than two diff'erent positions. Equivalently, there are at most two different sets 
C[,C'l that every cluster Cj assumes. Arthur and Vassilvitskii [5] used the obvious upper bound 
of 2^ for the length of an epoch (the term length refers to the number of iterations in the 
sequence). This upper bound has been improved to two. By the definition of length of an 
epoch, this means that after at most three iterations, either /c-means terminates or one cluster 
assumes a third configuration. 

Lemma 4.4 (Manthey, Roglin [21, Lemma 4.1]). The length of any epoch is at most two. 

For our analysis, we introduce the notion of {rj, c)-coarseness. In the following, A denotes the 
symmetric difference of two sets. 

Definition 4.5. We say that X is {rj, c)-coarse if for any pairwise distinct subsets Ci, C2, and C3 
of X with IC1AC2I < c and IC2AC3I < c, either ||cm(Ci)— cm(C2)|| > rj or ||cm(C2)— cm(C3)|| > rj. 



According to Lemma [4. 41 in every sequence of three consecutive iterations, one cluster assumes 
three different configurations. This yields the following lemma. 

Lemma 4.6. Assume that X is {r],c)-coarse and consider a sequence of three consecutive iter- 
ations. If in each of these iterations every cluster exchanges at most c points, then the potential 
decreases by at least rf . 

Proof. According to Lemma [4.41 there is one cluster that assumes three different configurations 
Ci, C2, and C3 in this sequence. Due to the assumption in the lemma, we have IC1AC2I ^ c and 
IC2AC3I < c. Hence, due to the definition of (r/, c)-coarseness, we have ||cm(Ci) — cm(Cj+i)|| > rj 
for one i £ {1,2}. Combining this with Lemma 12.31 concludes the proof. □ 
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Lemma 4.7. For r] > 0, the probability that X is not {r],c)-coarse is at most (7n) • {2ncr]/a) . 

Proof. Given any sets Ci, C2, and C3 with IC1AC2I < c and IC2AC3I < c, we can write Cj, for 
i £ {1, 2, 3}, uniquely as the disjoint union of a common ground set A ^ X with a set i?j C A" 
with Si n i?2 n -63 = 0. Furthermore, 

Si U S2 U S3 = (Ci U C72 U C3) \ A = (C1AC2) U (C2AC3), 

so |Si U S2 U S3I = |(CiAC2) U (C2AC3)| < 2c. 

We perform a union bound over all choices for the sets Si, S2, and S3. The number of choices 
for these sets is bounded from above by 7^'^(^J < (7n)^'^: we choose 2c candidate points to be in 
S1US2US3 and then for each point, we choose which set(s) it is in. We assume in the following 
that the sets Si, S2, and S3 are fixed. For i S {1, 2}, we can write cm(Cj) — cm(Ci+i) as 

^ cm(^) + • cm(S,) - , • cm(S,+i) . (6) 



|^| + |Si| |A| + |S,+i|y ' ' \A\ + \Bi\ ^ " |A| + |Si+i| 

Let us first consider the case that we have |Sj| = jSj+il for one i € {1,2}. Then cm(Cj 
cm(Cj4.i) simplifies to 

I S' I 1 
(cm(Si) - cm(Si+i)) 




|^| + |S,| ^ ^ " ^ \A\ + \Bi\ 

Since Sj 7^ Sj+i, there exists a point x E SjASj+i. Let us assume without loss of generality 
that X £ Bi\ Sj+i and that the positions of all points 

in (Sj U Sj_|-i) \ {x} are fixed arbitrarily. Then the event that ||cm(Ci) — cm(Cj+i)|| < ij is 
equivalent to the event that x lies in a fixed hyperball of radius {\A\ + |Sj|)r/ < nr/. Hence, the 
probability is bounded from above by {nrj/a)'^ < {2ncr]/a)'^. 

Now assume that |Si| 7^ IS2I 7^ I-B3I. For i £ {1,2}, we set 

1^1 1^1 (|^| + |S,|).(|A| + |S,+i|) 



]A\ + \Bi,\ \A\ + \Bi+i\J |^|-(|Sj+i|-|S,|) 
and 

= \A\ + \bL\ • ''^^''''-'^ - jAlfm ■ '"^^""'^ ■ 

According to ©, the event ||cm(Cj) — cm(Cj+i)|| < r] is equivalent to the event that cm(^) 
falls into the hyperball with radius \ri\r] and center rjZj. Hence, the event that both ||cm(Ci) — 
cm(C2)|| < r] and ||cm(C2) — cm(C3)|| < r] can only occur if the hyperballs B{riZi,\ri\r]) and 
B{r2Z2, \ r2\i]) intersect. This event occurs if and only if the centers riZi and r2Z2 have a distance 
of at most (|ri| + |?'2|)'? from each other. Hence, 

Pr[(||cm(Ci) - cm(C2)|| < t]) A (||cm(C2) - cm(C3)|| < v)] < Pr [lln^i - ^2^2!! < (|ri| + \r2\)r]] . 
After some algebraic manipulations, we can write the vector riZi — r2Z2 as 

|^| + |S2| 1^1 + IS2I 

2^ ^ 141 . HRol - iRnh ■ 2^ ^ 



|^|.(|S2|-|Si|) |A|.(|S3|-|S2|) ^^^^ 

+ Vi^i-(i^2i-ii?ii)^iAi-(is3i-is2i); ■ 
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Since Bi ^ B3, there must be an 2; G BiAB^. We can assume that x £ Bi \ B3. If x ^ B2, 
we let an adversary choose all positions of the points in Bi U B2 L) B3 \ {x}. Then the event 
\\riZi — r2Z2\\ < (|ri| + |r2|)7/ is equivalent to x falling into a fixed hyperball of radius 



\A\ ■ (iBo 



\Bi\) 



\A\ + \B2 



V 



i\B2 



1^1 + 1^1 



\B2 



\Bi\ 



+ 



\A\ + \B3 



\B-, 



\Bo 



rj < 2ncr] 



The probability of this event is thus bounded from above by {2ncr]/a)'^. 

It remains to consider the case that x S {Bi n B2) \ -B3. Also in this case we let an adversary 
choose the positions of the points in iJiU-B2U53\{2;}. Now the event \\riZi—r2Z2\\ < (|?^i| + |?^2|)?? 
is equivalent to x falling into a fixed hyperball of radius 



1^1 • (l^sl - \B2\ 



\A\ + \B2 



■(Inl + \r2\ 



V 



i\B3\-\B2\) 



( 


l^l + l^ll 




1^1 + l^3| 


) 




-B2 — \Bi\ 


+ 


-B3 — i?2 





rj < 2ncrj , 



Hence, the probability is bounded from above by {2ncri/a)'^ also in this case. 

This concludes the proof because there are at most (7n)^'^ choices for Bi, B2, and B^ and, for 
every choice, the probability that both ||cm(Ci) — cm(C2)|| < r] and ||cm(C2) — cm(C3)|| < r/ is at 
most {2ncrj/aY. 

Combining Lemmas 14.61 and 14.71 immediately yields the following result. 



□ 



Lemma 4.8. Fix e > and a constant Z2 £ N. Let A4 denote the smallest improvement made 
by any sequence of three consecutive iterations that follow blueprints whose nodes all have degree 
at most Z2- Then, 

' 0(1) -712(^2+1) \ 



Pr[A4 <e] <e- 



Proof. Taking rj = y^. Lemmas 14.61 and 14.71 immediately give 



Pr[A4<e]<(7n)2--(^^) 



Since d> 2, the lemma follows from Fact 12.11 and the fact that Z2 is a constant. 



□ 



4.5 Degenerate blueprints 

Lemma 4.9. Fix e G [0, 1]. Let A5 denote the smallest improvement made by any iteration that 
follows a degenerate blueprint. Then, 

Pr[A5<£] <e-(^ j. 

Proof. Consider such an iteration. Since the blueprint is degenerate, there must exist two clusters 
Ci and Cj that have identical approximate centers and that exchange a data point during the 
iteration. Let q and cj denote the actual centers of these clusters at the beginning of the 
iteration. By Lemma 13.31 6^ < \\ci — Cj\\ < 2^fne. However, we know from Lemma 12.51 that this 
occurs with probability at most e • (0(1) • rt"'^ /a)"^ . □ 
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4.6 Other Blueprints 



Now, after having ruled out five special cases, we can analyze the case of a general blueprint. 

Lemma 4.10. Fix e £ [0, 1]. Let Ag be the smallest improvement made by any iteration whose 
blueprint does not fall into any of the previous five categories with zi = 8 and Z2 = 7. This 
means that we consider only non- degenerate blueprints whose balanced nodes have in- and out- 
degree at least 8d-\- 1, that do not have nodes of degree one, that have at most two disjoint pairs 
of adjacent unbalanced node of degree 2, and that have a node with degree at least 8. Then, 



Proving this lemma requires some preparation. Assume that the iteration follows a blueprint 
B with m edges and b balanced nodes. We distinguish two cases: either the center of one 
unbalanced cluster assumes a position that is ^/ne away from its approximate position or all 
centers are at most \/ne far away from their approximate positions. In the former case the 
potential drops by at least e according to Lemma 13.31 If this is not the case, the potential 
drops if one of the points is far away from its corresponding approximate bisector according to 
Lemma 13.51 

The fact that the blueprint does not belong to any of the previous categories allows us to 
derive the following upper bound on its number of nodes. 

Lemma 4.11. Let B denote an arbitrary transition blueprint with m edges and b balanced nodes 
in which every node has degree at least two and every balanced node has degree at least 2dzi +2. 
Furthermore, let there be at most two disjoint pairs of adjacent nodes of degree two in B, and 
assume that there is one node with degree at least 2:2 + 1 > 2. Then the number of nodes in B is 
bounded from above by 



Proof. Let A be the set of nodes of degree two, and let B be the set of nodes of higher degree. 
We first bound the number of edges between nodes in A: There are at most two disjoint pairs of 
adjacent nodes of degree two. For each of these pairs, we define its extension to be the longest 
path of nodes of degree two containing the pair. We know that none of these extensions can 
form a cycle as the transition graph is connected and contains a node of degree 22 + 1 > 2. 
There are \h/2\ disjoint pairs in an extension consisting of h nodes. As the extensions contain 
all edges between nodes of degree 2, this implies that the number of edges between vertices in 
A is at most four. Let deg(^) and deg(i?) denote the sum of the degrees of the nodes in A and 
B, respectively. The total degree (iGg{A) of the vertices in A is 2\A\. Hence, there are at least 
2 1^1 — 8 edges between A and B. Therefore, 



Let t denote the number of nodes. The nodes in B have degree at least 3, there is one node 
in B with degree at least ^2 + 1, and balanced nodes have degree at least 2zid + 2 (and hence, 
belong to B). Therefore, if 6 = 0, 





2\A\ - 8 < deg(5) 2\A\ - 8 < 2m - 2\A\ 
=^ \A\ < -m + 2 . 



2m > 2\A\ + 3(t - 1^1 - 1) + Z2 + 1 
^2m+\A\ >3t + Z2-2 
5 

^ -m > 3t + Z2 - 4 . 
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If 6 > 1, then the node of degree at least Z2 + 1 might be balanced and we obtain 



2m > 2\A\ + {2zid + 2)b + 3(t - |^| - b) 

^2m + \A\ >3t + {2zid - 1)6 
5 

^ -m > 3t + {2zid - l)b - 2 . 



The lemma follows by solving these inequalities for t. 



□ 



We can now continue to bound Pr[A(;B) < A] for a fixed blueprint B. The previous lemma 
implies that a relatively large number of points must switch clusters, and each such point is 
positioned independently according to a normal distribution. Unfortunately, the approximate 
bisectors are not independent of these point locations, which adds a technical challenge. We 
resolve this difficulty by changing variables and then bounding the effect of this change. 

Lemma 4.12. For a fixed transition blueprint B with m edges and b balanced clusters that does 
not belong to any of the previous five categories and for any X'> 0, we have 



Proof. We partition the set of edges in the transition graph into reference edges and test edges. 
For this, we ignore the directions of the edges in the transition graph and compute a spanning 
tree in the resulting undirected multi-graph. We let an arbitrary balanced cluster be the root 
of this spanning tree. If all clusters are unbalanced, then an arbitrary cluster is chosen as the 
root. We mark every edge whose child is an unbalanced cluster as a reference edge. In this way, 
every unbalanced cluster Ci can be incident to several reference edges. But we will refer only to 
the reference edge between Cj's parent and Ci as the reference edge associated with Ci. Possibly 
except for the root, every unbalanced cluster is associated with exactly one reference edge. 
Observe that in the transition graph, the reference edge of an unbalanced cluster Ci can either 
be directed from Ci to its parent or vice versa, as we ignored the directions of the edges when 
we computed the spanning tree. From now on, we will again take into account the directions of 
the edges. 

For every unbalanced cluster i with an associated reference edge, we define the point qi as 



where Ai and Bi denote the sets of incoming and outgoing edges of Cj, respectively. The intuition 
behind this definition is as follows: as we consider a fixed blueprint B, once qi is fixed also the 
approximate center of cluster i is fixed. Let q denote the point defined as in ([7j) but for the root 
instead of cluster i. If all clusters are unbalanced and qi is fixed for every cluster except for the 
root, then also the value of q is implicitly fixed as q + ^qi = 0. Hence, once each qi is fixed, 
the approximate center of every unbalanced cluster is also fixed. 

Relabeling as necessary, we assume without loss of generality that the clusters with an as- 
sociated reference edge are the clusters Ci, . . . ,Cr and that the corresponding reference edges 
correspond to the points pi, . . . ,pr. Furthermore, we can assume that the clusters are topologi- 
cally sorted: if Ci is a descendant of Cj, then i < j. 

Let us now assume that an adversary chooses an arbitrary position for qi for every cluster 
Ci with i € [r]. Intuitively, we will show that regardless of how the transition blueprint B is 





(7) 
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Figure 1: Solid and dashed edges indicate reference and test edges, respectively. When comput- 
ing the spanning tree, the directions of the edges are ignored. Hence, reference edges 
can either be directed from parent to child or vice versa. In this example, the spanning 
tree consists of the edges ps, p7, pi, and p2, and its root is C4. We denote by Id the 
d X d identity matrix and by Od the d x d zero matrix. The first three columns of M 
correspond to qi, q2, and q^. The rows correspond to the points pi, ■ ■ ■ ,P7. Each block 
matrix Bi corresponds to an orthonormal basis of and is therefore orthogonal. 



chosen and regardless of how the adversary fixes the positions of the qi, there is still enough 
randomness left to conclude that it is unlikely that all points involved in the iteration are close 
to their corresponding approximate bisectors. We can alternatively view this as follows: Our 
random experiment is to choose the md-dimensional Gaussian vector p = {pi, . . . ,Pm), where 
Pi, . . . ,Pm S M*^ are the points that correspond to the edges in the blueprint. For each i G [r] 
and j G [d] let bij £ {—1, 0, 1}™-'^ be the vector so that the j-th. component of qi can be written 
as p ■ bij . Then allowing the adversary to fix the positions of the is equivalent to letting him 
fix the value of every dot product p ■ bij . 

After the positions of the qi are chosen, we know the location of the approximate center of 
every unbalanced cluster. Additionally, the blueprint provides an approximate center for every 
balanced cluster. Hence, we know the positions of all approximate bisectors. We would like to 
estimate the probability that all points Pr+i, ■ ■ ■ ,Pm have a distance of at most A from their 
corresponding approximate bisectors. For this, we further reduce the randomness and project 
each point pi with i £ {r + 1, . . . , m} onto the normal vector of its corresponding approximate 
bisector. Formally, for each i G {r + 1, . . . , m}, let hi denote a normal vector to the approximate 
bisector corresponding to pi, and let bi^i S [—1, 1]"^'^ denote the vector such that p ■ bi^i = pi ■ hi. 
This means that pi is at a distance of at most A from its approximate bisector if and only iip-bn 
lies in some fixed interval li of length 2A. As this event is independent of the other points pj 
with j 7^ i, the vector bn is a unit vector in the subspace spanned by the vectors e(j_i)d+i, ■ ■ ■ ,eid 
from the canonical basis. Let Bi = {bn, ■ ■ ■ ,bid} be an orthonormal basis of this subspace. Let 
M denote the (md) x (md) matrix whose columns are the vectors 611, . . . , bid, ■ ■ ■ , bmi, ■ ■ ■ , bmd- 
Figure [1] illustrates these definitions. 

For i G [r] and j G [d] , the values of p ■ bij are fixed by an adversary. Additionally, we allow 
the adversary to fix the values of p • bij for i G {r + 1, . . . , m} and j G {2, . . . , d}. All this 
together defines an (m — r)-dimensional affine subspace U of M"*'^. We stress that the subspace 
U is chosen by the adversary and no assumptions about U are made. In the following, we will 
condition on the event that p = (pi, . . . ,Pm) lies in this subspace. We denote by T the event that 
p -bii G 2j for all i G {r + 1, . . . ,d}. Conditioned on the event that the random vector p lies in 
the subspace U, p follows an (m — r)-dimensional Gaussian distribution with standard deviation 
a. However, we cannot directly estimate the probability of the event J- as the projections of the 
vectors bn onto the affine subspace U might not be orthogonal. To estimate the probability of 
J-, we perform a change of variables. Let ai, . . . , cim-r be an arbitrary orthonormal basis of the 
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{m — r)-dimensional subspace obtained by shifting U so that it contains the origin. Assume for 
the moment that we had, for each of these vectors ag, an interval I'^ such that can only occur 

if p-CLi If for every £. Then we could bound the probability of from above by Ff ,J as the 
p ■ ag can be treated as independent one-dimensional Gaussian random variables with standard 
deviation a after conditioning on U. In the following, we construct such intervals X^. 

It is important that the vectors bij for i G [m] and j G [d\ form a basis of M™"^. To see this, let 
us first have a closer look at the matrix M G M™-'^^™"' viewed as an m x m block matrix with 
blocks of size d x d. From the fact that the reference points are topologically sorted it follows 
that the upper left part, which consists of the first dr rows and columns, is an upper triangular 
matrix with non-zero diagonal entries. 

As the upper right (dr) x d{m — r) sub-matrix of M consists solely of zeros, the determinant of 
M is the product of the determinant of the upper left (dr) x (dr) sub-matrix and the determinant 
of the lower right d{m — r) x d{m — r) sub-matrix. Both of these determinants can easily be 
seen to be different from zero. Hence, also the determinant of M is not equal to zero, which in 
turn implies that the vectors bij are linearly independent and form a basis of M™'^. 

In particular, we can write every as a linear combination of the vectors bij. Let 

ae = ^ cfjbij 

for some coefficients cfj G M. Since the values of p ■ bij are fixed for i G [r] and j G [d] as well as 
for i G {r -|- 1, . . . , m} and j G {2, . . . , d}, we can write 

m 

p-di = Ki+ ^ c-i(p- 5ii) 

i=r+l 

for some constant K£ that depends on the fixed values chosen by the adversary. Let Cmax = 
max{|c-^| \ i > r}. The event happens only if, for every i > r, the value of p ■ bn lies in some 
fixed interval of length 2A. Thus, we conclude that JT can happen only if for every ^ G [m — r] 
the value of p-di lies in some fixed interval of length at most 2cmax{m — r)X. It only remains 
to bound Cmax from above. For £ e [m — r], the vector of the coefficients c^^ is obtained 
as the solution of the linear system Mc^ = a£. The fact that the upper right (dr) x d{m — r) 
sub-matrix of M consists only of zeros implies that the first dr entries of a£ uniquely determine 
the first dr entries of the vector c/. As dg is a unit vector, the absolute values of all its entries are 
bounded by 1. Now we observe that each row of the matrix M contains at most two non-zero 
entries in the first dr columns because every edge in the transition blueprint belongs to only 
two clusters. This and a short calculation shows that the absolute values of the first dr entries 
of c are bounded by r: The absolute values of the entries d(r — 1) -|- 1, . . . , dr coincide with the 
absolute values of the corresponding entries in and are thus bounded by 1. Given this, the 
rows d{r — 2) + 1, . . . , d{r — 1) imply that the corresponding values in d^ are bounded by 2 and 
so on. 

Assume that the first dr coefficients of are fixed to values whose absolute values are bounded 
by r. This leaves us with a system M'(c^)' = d^, where M' is the lower right ((m — r)d) x ((m — 
r)d) sub- matrix of M, (c^)' are the remaining (m — r)d entries of c^, and d^ is a vector obtained 
from CLi by taking into account the first dr fixed values of c^. All absolute values of the entries 
of d^ are bounded by 2r -|- 1. As M' is a diagonal block matrix, we can decompose this into 
m — r systems with d variables and equations each. As every d x d-block on the diagonal of the 
matrix M' is an orthonormal basis of the corresponding d-dimensional subspace, the matrices 
in the subsystems are orthonormal. Furthermore, the right-hand sides have a norm of at most 
(2r -|- l)Vd. Hence, we can conclude that Cmax is bounded from above by 3y/dr. 
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Thus, the probabihty of the event T can be bounded from above by 



n 

i=r+l 



IX' I 



27r(T 



< 



^\fdr(m — r)\ 



< 



a 



where we used that r(m — r) < m? /A. Using Fact 12. H we can replace the exponent m — r 
by a lower bound. If all nodes are unbalanced, then r equals the number of nodes minus one. 
Otherwise, if 6 > 1, then r equals the number of nodes minus b. Hence, Lemma 14.111 yields 



Pr[A(i3) < A] < < 



Vdm^\ 6 + 3 " 



{2zj^d-l)b-2 



if 6 = 0, 



if 6 > 1, 



which completes the proof. 



□ 



With the previous lemma, we can bound the probability that there exists an iteration whose 
transition blueprint does not fall into any of the previous categories and that makes a small 
improvement. 

Proof of Lemma \4-10\ Let B denote the set of (m, b, e)-blueprints that do not fall into the previ- 
ous five categories. Here, e is fixed but there are nk possible choices for m and b. As in the proof 
of Lemma 1331 we will use a union bound to estimate the probability that there exists a blueprint 
B gM with A{B) < A. Note that once m and b are fixed, there are at most (nk'^)"^ possible choices 

for the edges in a blueprint, and for every balanced cluster, there are at most ^ ^^ ^ choices 

for its approximate center. Also, in all cases, m > max(z2 + l,b{dzi + 1)) = max(8,86(i + 6), 
because there is always one vertex with degree at least Z2 + I, and there are always b vertices 
with degree at least 2dzi + 2. 

Now we set Y = k^ ■ VndD. Lemma |4. 121 vields the following bound: 



< 




,1/3 



m I (2zid+2)6-2 
6 3 



(8) 



Each term in the first sum simplifies as follows: 

m , 29 — 1 
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Furthermore, ^ + ^^3-^ > 1 + § > 3, so we can use Fact 12.1 1 to decrease the exponent here, which 
gives us 

\ 3 



6n8fc7dV2^i/2 



• e 



1/3 



0{l)-n^^k^^dy^D^/^\ 
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Similarly, each term in the second sum simplifies as follows: 



' DVd^ 



ne 



bd ^ m I {2zid+2)b-2 

Ya 



(nk 



bd 



m , (2zid+2)t.-2 



ne 



Furthermore, 



m {2zid + 2)b-2 8bd + b 16bd + 2b - 2 20bd 
6" 3 - 6 3 - ~3~' 



Therefore, we can further bound this quantity by 

3/20 



m I (2z^d+2)&-2 
6 3 



As noted above. 



\^ y/ne J cr 

m I (2zid+2)i)-2 



m (2zi(i + 2)6- 2 206(i 120 
6" 3 - ~3~ ^ "sT' 



so we can use Fact 12.11 to decrease the exponent, which gives us 

120/31 
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Using these bounds, we can simplify equation 
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On the other hand Y = ■ \/ndD > 1, so Lemma 12.51 guarantees 
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Finally, we know from Lemma f3.5l tliat if a blueprint B can result in a potential drop of at most 
s, then 6e ■ A{B) < GDVnde. We must therefore have either 6e < Ye^^^ or A{B) < . ^i/s^ 

Therefore, 



Pr[A6 < e] < Pr 

< e- 

< e- 

which concludes the proof. 



^B £ 



MB) < • e'/' 



' 0(1). ^327/10^29 ^23/10^13/5^ 



+ Pr 



+ e 



5e < Ye^l^ 
0(1) • n^^k^^d^D^ 



0(1) • n^^k^^d^D- 



□ 



4.7 The Main Theorem 

Given the analysis of the different types of iterations, we can complete the proof that fc-means 
has polynomial smoothed running time. 

Proof of Theorem Let T denote the maximum number of iterations that A;- means can need 
on the perturbed data set X, and let A denote the minimum possible potential drop over a 
period of three consecutive iterations. As remarked in Section [21 we can assume that all the 
data points lie in the hypercube [—D/2,D/2]'^ for D = y^90kd ■ ln(n), because the alternative 
contributes only an additive term of +1 to E [T]. 

After the first iteration, we know ^' < ndD^. This implies that if T > 3t+l, then A < ndD"^ /t. 
However, in the previous section, we proved that for e G (0, 1], 



Pr[A<e] < ^Pr[Ai<e] 



1=1 



< e 



0(1) • n^^k^^d^D^ 



a" 



Recall from Section [2] that T < n regardless of the perturbation. Therefore, 



E[T] < 0{ndD^)+ 3-P[T>3t + l] 

t=ndD^ 

< 0{ndD^) + 2_, 3 • P 

t=ndD^ 

3ndD^ f Oil) ■n^'^k^^d^D^ 

< 0{ndD^) + 2^ ' ^ ' 

t=ndD^ 
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which completes the proof. 



□ 
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5 Concluding Remarks 



In this paper, we settled the smoothed running time of the fc-means method for d > 2. For 
d = 1, it was already known that fe-means has polynomial smoothed running time [21]. 

The exponents in our smoothed analysis are constant but large. We did not make a huge 
effort to optimize the exponents as the arguments are intricate enough even without trying 
to optimize constants. Furthermore, we believe that our approach, which is essentially based 
on bounding the smallest possible improvement in a single step, is too pessimistic to yield a 
bound that matches experimental observations. A similar phenomenon occurred already in the 
smoothed analysis of the 2-opt heuristic for the TSP [13]. There it was possible to improve 
the bound for the number of iterations by analyzing sequences of consecutive steps rather than 
single steps. It is an interesting question if this approach also leads to an improved smoothed 
analysis of /c- means. 

Squared Euclidean distances, while most natural, are not the only distance measure used 
for A;-means clustering. The fc-means method can be generalized to arbitrary Bregman diver- 
gences [7]. Bregman divergences include the KuUback-Leibler divergence, which is used, e.g., in 
text classification, or Mahalanobis distances. Due to its role in applications, fc-means clustering 
with Bregman divergences has attracted a lot of attention recently [1,2]. Since only little is 
known about the performance of the A;-means method for Bregman divergences, we raise the 
question how the /c-means method performs for Bregman divergences in the worst and smoothed 
case. 
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